Problem Statement

Breast cancer causes the greatest number of cancer-related deaths among women.This year, an estimated 42,170 women will die from breast cancer in the U.S., (according to www.nationalbreastcancer.org). Using prediction techniques on genetic data has the potentials of giving the correct estimation of survival time and can prevent unnecessary surgical and treatment procedures.

“The Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) database is a Canada-UK Project which contains targeted sequencing data of 1,980 primary breast cancer samples. Clinical and genomic data was downloaded from cBioPortal.”

The dataset was collected by Professor Carlos Caldas from Cambridge Research Institute and Professor Sam Aparicio from the British Columbia Cancer Centre in Canada. Therefore, our population is woman who have visited this Cancer Centre. A description of your data, e.g. what is the unit of observation, what is the response variable, what are the predictors, how was the data collected, reference etc. Comment whether this sample of data is suitable to assess your population.

Data Description

The dataset was obtained from ‘Kaggle’ and contains 693 variables and 1,904 observations. This file includes 31 clinical attributes, m-RNA levels z-score for 331 genes, and mutation in 175 genes for 1904 breast cancer patients. The data was originally collected by Professor Carlos Caldas from Cambridge Research Institute and Professor Sam Aparicio from the British Columbia Cancer Centre in Canada. This data is representative of my population because this information includes patients who have been examined for breast cancer at varying levels rencently in the year 2020. For the purpose of this analysis I will be reducing the number of variables to 31 of the most relevant to this modeling. There was then the removal of ‘NA’ items in order for missing observations to be retracted from the dataset.


Data Summarization

  • The response variable is ‘overall_survival_months’ which represents the duration from the time of the intervention to death in months.

  • The predictor variables are ‘age_at_diagnosis’: age of the patient at diagnosis time, ‘cellularity’: cancer cellularity post chemotherapy, which refers to the amount of tumor cells in the specimen and their arrangement into clusters classifed as (“Low”, “Moderate”, or “High”), ‘tumor_size’: tumor size measured by imaging techniques in centimeters, and ‘tumor_stage’: the stage of the cancer based on the involvement of surrounding structures, lymph nodes and distant spread on a scale of 1-4.

Average Age at Diagnosis
## [1] 60.33846

The average age of women diagnosised with breast cancer at the British Columbia Cancer Centre in Canada is approximately 60 years of age.


Scatterplots of Predictors



The type of breast surgery variable is an object type that describes:

  • MASTECTOMY: which refers to a surgery to remove all breast tissue from a breast as a way to treat or prevent breast cancer.

  • BREAST CONSERVING: which refers to a urgery where only the part of the breast that has cancer is removed.

The cellularity variable is an object that describes the cancer cellularity post chemotherapy, which refers to the amount of tumor cells in the specimen and their arrangement into clusters.

The higher the cellularity found in the breast tissue is, the more likely the mass is to be considered malignant and in need of removal. Therefore, there is a higher distribution of needing to remove all the breast tissue by performing a mastectomy when the level of cellularity is at an increase.


Simple Linear Regression

\[ \operatorname{overall\_survival\_months} = \alpha + \beta_{1}(\operatorname{tumor\_size}) + \epsilon \]


The dependent variable here is the “overall survival months” and the explanatory here is the “tumor size”. The trend shown within the scatterplots show that the smaller the tumor in size, then the larger the number of months the patient spends alive. This could be related to the more aggressive the cancer becomes as it grows in size or metastasizes.

Outlier 130
overall_survival_months overall_survival pr_status radio_therapy X3.gene_classifier_subtype tumor_size
163 337.0333 1 Positive 0 ER+/HER2- High Prolif 14
Outlier 702
overall_survival_months overall_survival pr_status radio_therapy X3.gene_classifier_subtype tumor_size
784 351 0 Positive 1 22
Outlier 717
overall_survival_months overall_survival pr_status radio_therapy X3.gene_classifier_subtype tumor_size
808 335.7333 0 Negative 0 20

The maximum amount of overall survival months among the breast cancer patients in this dataset are 351 months and the minimum amount is 0.1 month.

The largest tumor size among the breast cancer patients in this dataset is 180cm and the smallest size is 1cm.

The outliers that I have identified could account for patients who chose to seek chemotherapy to slow or kill the disease (as well as choosing surgery or opting out). There are also possibilities such as a late diagnosis or a smaller size tumor that consisted of a high cluster of cellularity (presenting as more aggressive).

Multiple Regression

## 
## Call:
## lm(formula = overall_survival_months ~ tumor_size + tumor_stage + 
##     cellularity + age_at_diagnosis, data = newdat)
## 
## Coefficients:
##         (Intercept)           tumor_size          tumor_stage  
##            240.8551              -0.5426             -26.9790  
##     cellularityHigh       cellularityLow  cellularityModerate  
##             -5.8643             -11.9979               0.4635  
##    age_at_diagnosis  
##             -0.7902

\[ \begin{aligned} \operatorname{\widehat{overall\_survival\_months}} &= 240.86 - 0.54(\operatorname{tumor\_size}) - 26.98(\operatorname{tumor\_stage}) - 5.86(\operatorname{cellularity}_{\operatorname{High}})\ - \\ &\quad 12(\operatorname{cellularity}_{\operatorname{Low}}) + 0.46(\operatorname{cellularity}_{\operatorname{Moderate}}) - 0.79(\operatorname{age\_at\_diagnosis}) \end{aligned} \]



Comparing the Simple Linear Model and the Multiple Regression Model

Both the adjusted R-squared (is higher for model 2) and the AIC suggests that model 2 does a better job.


## [1] "Model 1: 0.05"
## [1] "Model 2: 0.1"
df AIC
newmodel 3 15085.38
mlr.model 8 15015.18

Hypothesis Test

Null Hypothesis: the errors follow a normal distribution.

Alternative Hypothesis: the errors do not follow a normal distribution.

The critical threshold is 0.05

Two-sided (Normal Distribution)

##  [1] -1.092699495 -0.100984241  1.597988452  0.471057217  0.005148253
##  [6]  0.085753527  1.303135935 -0.015692062 -0.077813631 -2.158776942
## [11]  0.672000771  0.092052104  1.013243271
## 
##  Anderson-Darling normality test
## 
## data:  rnorm(mlr.model)
## A = 0.36122, p-value = 0.389

Since the Anderson-Darling test statistic is 0.1938 with an associated p-value of 0.867, we fail to reject the null hypothesis and conclude that it is reasonable to assume that the errors have a normal distribution

Summary and Reflection

Given what I know about regression I believe I have done a relevant job with the model. Some parts I would want to improve would be the exploratory data analysis on the variables I decided not to currently focus on in these two models. I would also like to incorporated the merging of other data sets of collected breast cancer patient information, but from another population of maybe women from a cancer center in the United States or another region.

In summary, my population of women patients from the British Columbia Cancer Centre in Canada includes a large number of women that are being affected by a disease that is still taking thousands of lives each year. There are several cases of women you have also continued to live after diagnosis and patients currently undergoing treatments to fight against it such as chemotherapy. This dataset and my findings show that there is a high correlation between the amount time a breast cancer has to live based on at what age the cancer is first caught, the level of cellularity in the breast tissue, and the size and stage of tumor.

When these factors are taken into account it allows for a judgment call to made between doctor and patient. There is the ability to look at the probability a patient will live (in months). This can be the deciding factor between a fighting chance undergoing chemotherapy or peaceful last days with family members that won’t leave your family in debt.